Quality Assessment of General and Categorized Arabic Text Corpora

نویسندگان

  • Ali Salhi
  • Adnan Yahya
چکیده

Many Natural Language Processing and Information Retrieval methods are based on the extensive use of text corpora. The credibility of the results can be heavily influenced by the underlying corpus quality. Much research has been utilizing Arabic corpora into various tasks of Arabic Information Processing. In this paper we discuss a suite of metrics that can be used to ascertain the quality of Arabic corpora. We borrow heavily from the extensive work on corpora quality for other languages and try to adapt some of them for Arabic. We also apply these measures to sample corpora, including categorized corpora and report on the results. The main corpora we experiment with are: a general corpus extracted from newspaper (AlQuds newspaper) articles covering the years 2009 -2012 and a highly categorized (split into 9 major and 25 minor categories) corpus built from Arabic Wikipedia. We employ different filtration methods, discuss and examine different corpora features as quality metrics. The metrics are based on parameters such as character/word N-gram frequencies and Zipf’s law applicability. We also study error rates, vocabulary properties, monolinguality, the effects of normalization, as well as corpora stability with size growth. It is our intention to make our corpora quality assessment tools available online for possible use by other researchers. Keywords— Arabic Text Corpora, Corpus Quality Measures, Arabic Information Retrieval, Zipf‘s Law, Statistical Analysis of Text, Arabic NLP

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

The Quality Assessment of the Persian Translation of “The Graveyard” Based on House's Translation Quality Assessment Model

The notion of Translation Quality Assessment has been the object of many studies. TQA as the compre- hensive treatment of translation evaluation, focuses on the relationship between the source text and target text and also emphasizes that translation is a linguistic operation. To accomplish this research, House‟s model of TQA was chosen as the framework for the investigation ...

متن کامل

Assessing the Quality of Persian Translation of Orwell’s Nineteen Eighty-Four Based on House’s Model: Overt-Covert Translation Distinction

This study aimed to assess the quality of Persian translation of Orwell's (1949) Nineteen Eighty-Four by Balooch (2004) based on House's (1997) model of translation quality assessment. To do so, 23 pages (about 10 percent) of the source text were randomly selected. The profile of the source text register was produced and the genre was realized. The source text profile was compared to t...

متن کامل

Towards a Measure for Arabic Corpora Quality

In this paper we present a statistical measure which for the first time is used to evaluate the quality of Arabic corpora. This measure is entirely based on statistical data and languageindependent. However, the values which might be obtained in the experiments could be very different for corpora written in different languages. Our experiments were conducted using Arabic corpora. We have chosen...

متن کامل

Assessing the Quality of Persian Translation of Orwell’s Nineteen Eighty-Four Based on House’s Model: Overt-Covert Translation Distinction

This study aimed to assess the quality of Persian translation of Orwell's (1949) Nineteen Eighty-Four by Balooch (2004) based on House's (1997) model of translation quality assessment. To do so, 23 pages (about 10 percent) of the source text were randomly selected. The profile of the source text register was produced and the genre was realized. The source text profile was compared to t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014